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Abstract 

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The 
nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word to 
follow its own path to a topic node according to a document-specific distribution on a shared tree. This 
alleviates the rigid, single-path formulation of the nCRP, allowing a document to more easily express 
thematic borrowings as a random effect. We derive a stochastic variational inference algorithm for the 
model, in addition to a greedy subtree selection method for each document, which allows for efficient 
inference using massive collections of text documents. We demonstrate our algorithm on 1.8 million 
documents from The New York Times and 3.3 million documents from Wikipedia. 

Index Terms 

Bayesian nonparametrics, Dirichlet process, topic modeling, stochastic inference 

I. Introduction 

Organizing things hierarchically is a natural process of human activity. Walking into a large department 
store, one might first find the men's section, followed by men's casual, and then see the t-shirts hanging 
along the wall. Or one may be in the mood for Italian food, decide whether to spring for the better, more 
authentic version or go to one of the cheaper chain options, and then end up at the Olive Garden. Similarly 
with data analysis, a hierarchical tree-structured representation of the data can provide an illuminating 
means for understanding and reasoning about the information it contains. 

The nested Chinese restaurant process (nCRP) U is a model that performs this task for the problem 
of topic modeling. Hierarchical topic models place a structured prior on the topics underlying a corpus 
of documents, with the aim of bringing more order to an unstructured set of thematic concepts ifTTl |[2Tl lf3ll . 
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Fig. 1. An example of path structures for the nested Chinese restaurant process (nCRP) and the nested hierarchical Dirichlet 
process (nHDP) for hierarchical topic modeling. With the nCRP, the topics for a document are restricted to lying along a single 
path to a root node. With the nHDP, each document has access to the entire tree, but a document-specific distribution on paths 
will place high probability on a particular subtree. The goal of the nHDP is to learn a thematically consistent tree as achieved 
by the nCRP, while allowing for the cross-thematic borrowings that naturally occur within a document. 

They do this by learning a tree structure for the underlying topics, with the inferential goal being that 
topics closer to the root are more general, and gradually become more specific in thematic content when 
following a path down the tree. 

The nCRP is a Bayesian nonparametric prior for hierarchical topic models, but is limited in the 
hierarchies it can model. We illustrate this limitation in Figure [T] The nCRP models the topics that go 
into constructing a document as lying along one path of the tree. From a practical standpoint this is 
a disadvantage, since inference in trees over three levels is computationally hard QUI, an d hence in 
practice each document is limited to only three underlying topics. Moreover, this is also a significant 
disadvantage from a modeling standpoint. 

As a simple example, consider a document on ESPN.com about an injured player, compared with an 
article in a sports medicine journal. Both documents will contain words about medicine and words about 
sports. Should the nCRP select a path transitioning from sports to medicine, or vice versa? Depending 
on the article, both options are reasonable, and during the learning process the model will either acquire 
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both paths, hence partitioning sports and medicine words among multiple topics, or choose one over the 
other, which will require all documents containing the topic from the lower level to least have the higher 
level topic activated. In one case the model is not using the full statistical power within the corpus to 
model each topic and in the other the model is learning an unreasonable tree. Returning to the practical 
aspect, for trees truncated to a small number of levels, there simply is not enough room to learn all of 
these combinations. 

Though the nCRP is a Bayesian nonparametric prior, it performs nonparametric clustering of document- 
specific paths, which fixes the number of available topics to a small number for trees of a few levels. Our 
goal is to develop a related Bayesian nonparametric prior that performs word-specific path clustering. 
We illustrate this objective in Figure [T] In this case, each word has access to the entire tree, but with 
document-specific distributions on paths within the tree. To this end, we make use of the hierarchical 
Dirichlet process [4], developing a novel prior that we refer to as the nested hierarchical Dirichlet process 
(nHDP). The HDP can be viewed as a nonparametric elaboration of the classical topic model, the latent 
Dirichlet allocation (LDA) model [5], providing a mechanism whereby a top-level Dirichlet process 
provides a base distribution for a collection of second-level Dirichlet processes, one for each document. 
With the nHDP, a top-level nCRP becomes a base distribution for a collection of second-level nCRPs, 
one for each document. The nested HDP provides the opportunity for cross-thematic borrowing that is 
not possible with the nCRP. 

Hierarchical topic models have thus far been applied to corpora of small size. A significant issue, not 
just with topic models but with Bayesian models in general, is scaling up inference to massive data sets 
[6]. Recent developments in stochastic variational inference methods have done this for LDA and the 
HDP topic model [7][8][9]. We continue this development for hierarchical topic modeling with the nested 
HDP. Using stochastic VB, in which we maximize the variational objective using stochastic optimization, 
we demonstrate the ability to efficiently handle very large corpora. This is a major benefit to complex 
models such as tree-structured topic models, which require significant amounts of data to support their 
exponential growth in size. 

We organize the paper as follows: In Section [n] we review the Bayesian nonparametric priors that 
we incorporate in our model — the Dirichlet process, nested Chinese restaurant process and hierarchical 



Dirichlet process. In Section III we present our proposed nested HDP model for hierarchical topic 



modeling. In Section IV we review stochastic variational inference and present an inference algorithm for 
nHDPs that scales well to massive data sets. We present empirical results in Section [V] We first compare 
the nHDP with the nCRP on three relatively small data sets. We then evaluate our stochastic algorithm on 
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1.8 million documents from The New York Times and 3.3 million documents from Wikipedia, comparing 
performance with stochastic LDA and stochastic HDP. 

II. Background: Bayesian nonparametric priors for topic models 

The nested hierarchical Dirichlet process (nHDP) builds on a collection of existing Bayesian nonpara- 
metric priors. In this section, we provide a review of these priors: the Dirichlet process, nested Chinese 
restaurant process and hierarchical Dirichlet process. We also review constructive representations for these 
processes that we will use for posterior inference of the nHDP topic model. 

A. Dirichlet processes 

The Dirichlet process (DP) iTTOll is the foundation for a large collection of Bayesian nonparametric 
models that rely on mixtures to statistically represent data. Mixture models work by partitioning a data 
set according to statistical traits shared by members of the same cell. Dirichlet process priors are effective 
in the learning of the number of these traits, in addition to the parameters of the mixture. The basic form 
of a Dirichlet process mixture model is 

oo 

W n \(f n ~ F W (<Pn), <Pn\G % ~ G, G = y~]pi6$ t . (1) 

i=l 

With this representation, data W\ , . . . , Wn are distributed according to a family of distributions Fw with 
respective parameters (pi, . . . , tp^. These parameters are drawn from the distribution G, which is discrete 
and potentially infinite, as the DP allows it to be. This discreteness induces a partition of the data W 
according to the sharing of the atoms {9i\ among the parameter selections {tp n }- 

The Dirichlet process is a stochastic process on random elements G. To briefly review, let (@,B) be 
a measurable space, Go a probability measure on it and a > 0. Ferguson proved the existence of a 
stochastic process G where, for all partitions {Bi, . . . , B k } of ©, 

(G(Bx), . . . , G{B k )) ~ Dirichlet(aG (5i), . . . , aG (B k )), 

abbreviated as G ~ DP(aGo). It has been shown that G is discrete (with probability one) even when Go 
is non-atomic |[lT1l[T2l . though the probability that the random variable G(Bk) is less than e increases 
to 1 as B k decreases to a point for every e > 0. Thus the DP prior is a good candidate for G in ([TJ 
since it generates discrete distributions on infinitely large parameter spaces. For most applications Go is 
continuous, and so representations of G at the granularity of the atoms are necessary for inference; we 
next review two approaches to working with this infinite-dimensional distribution. 
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1 ) Chinese restaurant process: The Chinese restaurant process (CRP) avoids directly working with G 
by integrating it out lfTTI|fl"3l . In doing so, the values of <pi,...,<PN become dependent, with the value 
of (p n +i given tpi,...,<p n distributed as 

n 1 a 
fn+llVli ■ ■ ■ i ¥>n ~ V, — ; — S v H ■ — Gq. (2) 

That is, (p n +i takes the value of one of the previously observed tpi with probability and a value 

drawn from Go with probability ^ ri , which will be unique when Go is continuous. This displays the 
clustering property of the CRP and also gives insight into the impact of a, since it is evident that the 
number of unique cpi grows like a In (a + n). In the limit n — > oo, the distribution in ^ converges to 
a random measure distributed according to a Dirichlet process [11]. The CRP is so-called because of 
an analogy to a Chinese restaurant, where a customer (datum) sits at a table (selects a parameter) with 
probability proportional to the number of previous customers at that table, or selects a new table with 
probability proportional to a. 

2) Stick-breaking construction: Where the Chinese restaurant process works with G ~ DP(aGo) 
implicitly through ip, a stick-breaking construction allows one to directly construct G before drawing any 
ip n . Sethuraman [12] showed that if G is constructed as follows: 

oo i— 1 

G = J2ViH(l-Vj)6e t , Fi~Beta(l,a), 0< ~ G , (3) 

i=l j=l 

then G ~ DP(aGo). The variable V{ can be interpreted as the proportion broken from the remainder of a 
unit length stick, II^iCl — Vj). As the index % increases, more random variables in [0, 1] are multiplied, 
and thus the weights exponentially decrease to zero; the expectation E[P£ IX/<i(l — ^j)] = (i+ly §i ves 
a sense of the impact of a on these weights. This explicit construction of G maintains the independence 
among tpi,...,tpN as written in Equation ([T}, which is a significant advantage of this representation for 
mean-field variational inference that is not present in the CRP. 

B. Nested Chinese restaurant processes 

Nested Chinese restaurant processes (nCRP) are a tree-structured extension of the CRP that are useful 
for hierarchical topic modeling [T). They extend the CRP analogy to a nesting of restaurants in the 
following way: After selecting a table (parameter) according to a CRP, the customer departs for another 
restaurant only indicated by that table. Upon arrival, the customer again acts according to the CRP for 
the new restaurant, and again departs for a restaurant only accessible through the table selected. This 



DRAFT 



6 



occurs for a potentially infinite sequence of restaurants, which generates a sequence of parameters for 
the customer according to the selected tables. 

A natural interpretation of the nCRP is as a tree where each parent has an infinite number of children. 
Starting from the root node, a path is traversed down the tree. Given the current node, a child node is 
selected with probability proportional to the previous number of times it was selected among its siblings, 
or a new child is selected with probability proportional to a. As with the CRP, the nCRP also has a 
constructive representation useful for variational inference which we now discuss. 

1 ) Constructing the nCRP: The nesting of Dirichlet processes that leads to the nCRP gives rise to a 
stick-breaking construction |2]{^]We develop the notation for this construction here and use it later in our 
construction of the nested HDP. Let £/ = . . . , i{) be a path to a node at level I of the treej^] According to 
the stick-breaking version of the nCRP, the children of node /; are countably infinite, with the probability 
of transitioning to child j equal to the jth break of a stick-breaking construction. Each child corresponds 
to a parameter drawn independently from Go. Letting the index of the parameter identify the index of 
the child, this results in the following DP for the children of node ij, 

oo j—1 

Gk=Y. V ^\{^- V ^K,v V h,i ~Beta(l,a), (/lli )~G o . (4) 

j=l m=l 

If the next node is child j, then the nCRP transitions to DP G, i+1 , where has index j appended to 
//, that is = A sequence of parameters ip = (tpi, ip2, . ■ .) generated from a path down this 

tree follows a Markov chain, where the parameter <pi correspond to an atom at level I and the stick- 
breaking weights correspond to the transition probabilities. Hierarchical topic models use these sequences 
of parameters as topics for generating documents. 

2 ) Nested CRP topic models: Hierarchical topic models based on the nested CRP use a globally shared 
tree to generate a corpus of documents. Starting with the construction of nested Dirichlet processes as 
described above, each document selects a path down the tree according to a Markov process, which 
produces a sequence of topics tp d = (<fd,i, <Pd,2, • • • ) used to generate the document. As with other topic 
models, each word in a document is represented by an index Wd, n £ {1j>--jV} and the atoms 
appearing in ip d are V-dimensional probability vectors with prior Go a Dirichlet distribution. 

'The "nested Dirichlet process" that we present here was first described (using random measures rather than the stick-breaking 
construction) by 1 14], who developed it for a two-level tree. 

2 That is, from the root node first select the child with index i\\ from node ii = select the child with index ii\ from 
node i - 2 = {ii,l2) select the child with index is, and so on to level I. We ignore the root io, which is shared by all paths. 
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For each document d, a new stick-breaking process provides a distribution on the topics in cp d , 

oo j-l 

= £ U d j H (1 - E/ dim )<W. » ~ Beta (7i , 72)- (5) 

j=l m=l 

Following the standard method, words for document d are generated by first drawing a parameter i.i.d. 
from G^ d \ and then drawing the word index from the discrete distribution with the selected parameter. 

3) Issues with the nCRP: As discussed in the introduction, a significant drawback of the nCRP for 
topic modeling is that each document follows one path down the tree. Therefore, all thematic content of a 
document must be contained within that single sequence of topics. Since the nCRP is meant to characterize 
the thematic content of a corpus in increasing levels of specificity, this creates a combinatorial problem, 
where similar topics will appear in many parts of the tree to account for the possibility that they appear 
as a random effect in a document. In practice, nCRP trees are typically truncated at three levels Ell fill , 
since learning deeper levels becomes difficult due to the exponential increase in nodesj^ln this situation 
each document has three topics for modeling its entire thematic content, and so a blending of multiple 
topics is likely to occur during inference. 

The nCRP is a BNP prior, but it performs nonparametric clustering of the paths selected at the document 
level, rather than at the word level. Though the same tree is shared by a corpus, each document can 
differentiate itself by the path it choses. The key issue with the nCRP is the restrictiveness of this single 
path allowed to a document. If instead each word were allowed to follow its own path according to an 
nCRP, this characteristic would be lost and only a tree level distribution similar to Equation ([5]) could 
distinguish one document from another and thematic coherence would be missing. Our goal is to develop 
a hierarchical topic model that does not prohibit a document from using topics in different parts of the 
tree. Our solution to this problem is to employ the hierarchical Dirichlet process (HDP) H. 

C. Hierarchical Dirichlet processes 

The HDP is a multi-level version of the Dirichlet process. It makes use of the idea that the base 
distribution on the infinite space G can be discrete, and indeed a discrete distribution allows for multiple 
draws from the DP prior to place probability mass on the same subset of atoms. Hence different groups 
of data can share the same atoms, but place different probability distributions on them. A discrete base 
is needed, but the atoms are unknown in advance. The HDP achieves this by drawing the base from a 

3 This includes a root node topic, which is shared by all documents and is intended to collect stop words. 
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DP prior. This leads to the hierarchical process 

G d \G ~ DP(PG), G~DP(aG ), (6) 

for groups d = 1, . . . , D. This prior has been used to great effect in topic modeling as a nonparametric 
extension of LDA and related LDA-based models lfT31 lfl6l ifTTI . 

As with the DP, concrete representations of the HDP are necessary for inference. The representation we 
use relies on two levels of Sethuraman's stick breaking construction. For this construction, after sampling 
G as in Equation ([3]), we sample Gd in the same way, 

OO I — 1 

°d = J2 V i i Il^- V f^ K*~Beta(l,/3), fa ~ G. (7) 

i=l j=l 

This form is identical to Equation ([3]), with the key difference that G is discrete, and so atoms fa will 
repeat. An advantage of this representation is that all random variables are i.i.d., with significant benefits 
to variational inference strategies. 

III. Nested hierarchical Dirichlet processes for topic modeling 

In building on the nCRP framework, our goal is to allow for each document to have access to the entire 
tree, while still learning document-specific distributions on topics that are thematically coherent. Ideally, 
each document will still exhibit a dominant path corresponding to its main themes, but with offshoots 
allowing for random effects. Our two major changes to the nCRP formulation toward this end are that 
(i) each word follows its own path to a topic, and (ii) each document has its own distribution on paths 
in a shared tree. The BNP tools discussed above make this a straightforward task. 

We split the process of generating a document's distribution on topics into two parts: generating a 
document's distribution on paths down the tree, and generating a word's distribution on terminating at a 
particular node within those paths. 

A. Constructing the tree for a document 

With the nHDP, all documents share a global nCRP constructed with a stick-breaking construction as 



in Section II-B1 Denote this tree by T. As discussed, T is simply an infinite collection of Dirichlet 
processes with a continuous base distribution Go and a transition rule between DPs. According to this 
rule, from a root Dirichlet process Gi , a path is followed by drawing ipi+i ~ G,- ; for / = 0, 1, 2, ... , 
where io is a constant root index that we ignore, and ii = indexes the current DP associated 

with ipi = 9i l . With the nested HDP, we do not perform this path selection on the top-level T, but instead 
use each Dirichlet process in T as a base for a second level DP drawn independently for each document. 
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That is, for document d we construct a tree Td, where for each G,-, G T, we draw a corresponding 
(?} G 7^ independently in d according to a second-level Dirichlet process 

~VP(pG h ). (8) 



As discussed in Section 



II-C 



G^ will have the same atoms as Gi t , but with different probability weights 



on them. Therefore, the tree Td will have the same nodes as T, but the probability of a path in Td will 
vary with d, giving each document its own distribution on a shared tree. 



We represent this second-level DP with a stick-breaking construction as in Section II-C 



oo j—l 

G U =E^n( 1 -^ ( >C ! l ~ Beta(l,/3), ^ G h . (9) 

j=l m=l 

This representation retains full independence among random variables, and will lead to a simple stochastic 
variational inference algorithm. We note that the atoms from the top-level DP are randomly permuted 
and copied with this construction; does not correspond to the node with parameter To find 

the probability mass G^ places on one can calculate 

Using a nesting of HDPs to construct Td, each document has a tree with transition probabilities defined 
over the same subset of nodes since T is discrete, but with values for these probabilities that are document 
specific. To see how this permits each word to follow its own path while still retaining thematic coherence 
within a document, consider each Gf when j3 is small. In this case, most of the probability will be 
placed on one atom selected from G^ since the first proportion vffl will be large with high probability. 
This will leave little probability remaining for other atoms, a feature of the prior on all second-level DPs 
in Td- Starting from the root node of Td, each word will be highly "encouraged" to select one particular 
atom at any given node, with some probability of diverging into a random effect topic. In the limit /3 — > 0, 
each Gj® will be a delta function on a <f>^ ~ G^, and the same path will be selected by each word 
with probability one, thus recovering the nCRP 

B. Generating a document 

With the tree Td for document d we have a method for selecting word-specific paths that are thematically 



coherent. We next discuss generating a document with this tree. As discussed in Section |II-B2| with the 
nCRP the atoms selected for a document by its path through T have a unique stick-breaking distribution 
determining which level any particular word comes from. We generalize this idea to the tree Td with an 
overlapping stick-breaking construction as follows. 
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Algorithm 1 Generating Documents with the Nested Hierarchical Dirichlet Process 



Step 1. Generate a global tree T by constructing an nCRP as in Section II-B1 



Step 2. Generate document tree Td and switching probabilities U For document d, 

a) For each DP in T, draw a second-level DP with this base distribution (Equation [8]). 

b) For each node in T d (equivalently T), draw a beta random variable (Equation [T0| ). 
Step 3. Generate the documents. For word n in document d, 



a) Sample atom (p n ^ = 9{ t with probability given in Equation (111. 

b) Sample W U: d from the discrete distribution with parameter ip dn - 



H^d,n = O h \Td,Ud) 



u d , h n (i - u d,i„ 



m=l 



(11) 



For each node i/, we draw a document-specific beta random variable that acts as a stochastic switch; 
given a word is at node i\, it determines the probability that the word uses the topic at that node or 
continues on down the tree. That is, given the path for word Wd, n is at node £/, stop with probability 

U d>il ~Beta( 7l , 72 ), (10) 

or continue by selecting node according to . We observe the stick-breaking construction implicit 
in this construction; for word n in document d, the probability that its topic tp^ n = 9^ is 

2-1 

JmCil 

We use i m C i\ to indicate that the first m values in i\ are equal to i m . The leftmost term in this expression 
is the probability of path ii, the right term is the probability that the word does not select the first / — 1 
topics, but does select the Zth. Since all random variables are independent, a simple product form results 
that will significantly aid the development of a posterior inference algorithm. The overlapping nature 
of this stick-breaking construction on the levels of a sequence is evident from the fact that the random 
variables U are shared for the first / values by all paths along the subtree starting at node ij. A similar 
tree-structured prior distribution was presented by Adams, et al. |[T8ll in which all groups shared the same 
distribution on a tree and entire objects (e.g. images or documents) were clustered within a single node. 
We summarize our model for generating documents with the nHDP in Algorithm [T] 

IV. Stochastic variational inference for the Nested HDP 

Many text corpora can be viewed as "Big Data" — they are large data sets for which standard inference 
algorithms can be prohibitively slow. For example, Wikipedia currently indexes several million entries, and 
The New York Times has published almost two million articles in the last 20 years. With so much data, fast 
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inference algorithms are essential. Stochastic variational inference is a development in this direction for 
hierarchical Bayesian models in which ideas from stochastic optimization are applied to approximate 
Bayesian inference using mean-field variational Bayes |[T9ll[7ll . Stochastic inference algorithms have 
provided significant speed-ups in inference for probabilistic topic models [8][9][20|. In this section, after 
reviewing the ideas behind stochastic variational inference, we present a stochastic variational inference 
algorithm for the nHDP topic model. 

A. Stochastic variational inference 

Stochastic variational inference exploits the difference between local variables, or those associated 
with a single unit of data, and global variables, which are shared among an entire data set. In brief, 
stochastic VB works by splitting a large data set into smaller groups, processing the local variables 
of one group, updating the global variables, and then moving to another group. This is in contrast to 
batch inference, which processes all local variables at once before updating the global variables. In the 
context of probabilistic topic models, the unit of data is a document, and the global variables include the 
topics (among other variables), while the local variables relate to the distribution on these topics for each 
document. We next briefly review the relevant ideas from variational inference and its stochastic variant. 

1 ) The batch set-up: Mean-field variational inference is a method for approximate posterior inference 
in Bayesian models I2B . It approximates the full posterior of a set of model parameters P(<&|W) with 
a factorized distribution Q(&\^) = Yl i qi(4>i\ipi)- It does this by searching the space of variational 
approximations for one that is close to the posterior according to their Kullback-Liebler divergence. 
Algorithmically, this is done by maximizing the variational objective C with respect to the variational 
parameters ^ of Q, where 



We are interested in conjugate exponential models, where the prior and likelihood of all nodes of 
the model fall within the conjugate exponential family. In this case, variational inference has a simple 
optimization procedure E2l . which we illustrate with the following example — this generic example gives 
the general form exploited by the stochastic variational inference algorithm that we apply to the nHDP. 

Consider D independent samples from an exponential family distribution P(W\rj), where rj is the 
natural parameter vector. The likelihood under this model has the standard form 



C{W,V) = E Q []nP(W,$)] - E Q []nQ]. 



(12) 



P(W 1 ,...,W D \ V )= Y[h(w d ) exp\r, T J2t(wd)-DA( V ) 
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The sum of vectors t(wd) forms the sufficient statistics of the likelihood. The conjugate prior on rj has 
a similar form 

P (V\X, v ) = fix, v) exp {r] T x ~ vA{r))} . 

Conjugacy between these two distributions motivates selecting a q distribution in this same family to 
approximate the posterior of rj, 

The variational parameters x' an d v' are free and are modified to maximize the lower bound in Equation 
(p^]>P] Inference proceeds by taking the gradient of C with respect to the variational parameters of a 
particular q, in this case the vector ip := [x' T , v'] T , and setting to zero to find their updated values. For 
the conjugate exponential example we are considering, this gradient is 

D 

X + ^t(w d ) 



a 2 inf( X ',w) a 2 infix' y) 



9x'9x' T 



dx'dv' 



a 2 in f(x' y) a 2 in/( x >o 

du'dx' T dv 12 



X 



(13) 



d=l 

v + D-v' 

Setting this to zero, one can immediately read off the variational parameter updates from the rightmost 
vector. In this case they are \' = X + SdLi t( w d) an ^ v' = v -\- D, which involve the sufficient statistics 
for the q distribution calculated from the data. 

2) A stochastic extension: Stochastic optimization of the variational lower bound modifies batch 
inference by forming a noisy gradient of C at each iteration. The variational parameters for a random 
subset of the data are optimized first, followed by a step in the direction of the noisy gradient of the 
global variational parameters. Let C s C {1, . . . , D} index a subset of the data at step s. Also let (f)d 
be the hidden local variables associated with observation and let <3?vF be the global variables shared 
among all observations. The stochastic variational objective function C s is the noisy version of C formed 
by selecting a subset of the data, 

D 



£ s (W Cs ,y) = E Q [lnP(«; d) ^|* w )]+E Q [lnP($ w )-lnQ]. 



(14) 



dec s 



This takes advantage of the conditional independence among the data, and so the log of the joint likelihood 
can be written as a sum over the D observations. Optimizing C s optimizes C in expectation; since each 
subset C s is equally probable, with p(C s ) = {\c \) > anc ^ s i nce d £ C s for 
subsets, it follows that E p{Cs) [C s (Wc 3 ,^)} = C(W, 



N of the possible 



v|c s |-i; 



y\c s \J 



A closed form expression for the lower bound is readily derived for this example. 



DRAFT 



13 



Stochastic variational inference proceeds by optimizing the objective in (14) with respect to ipd for 
d G C s , followed by an update to that blends the new information with the old. For example, in our 
conjugate exponential example the update of the global variational parameter tp := [x' T , v'] T at step s is 
ip s = ips-l + PsBV 1 pC s {Wc s , where the matrix B is a positive definite preconditioning matrix and 
p s is a step size satisfying Y^Li Ps = 00 an d Y^T=i Ps < °°> which ensures convergence ||T9l . 



The gradient V ^C s {Wc s -,^) has a similar form as Equation (13 1, with the exception that the sum is 
taken over a subset of the data. Though the matrix in ( p"3) ) is often very complicated, it is superfluous 
to batch variational inference for conjugate exponential family models. In the stochastic optimization 



of Equation ( 12 1, however, this matrix cannot be similarly ignored. The key to stochastic variational 
inference for conjugate exponential models is in selecting B. Since the gradient of C s has the same 
form as Equation ( [13] ), B can be set to the inverse of the matrix in (13) to allow for cancellation. An 
interesting observation is that this matrix is 

B = -( 9 ^&fi Y\ (15) 



dipdip T J 

which is the inverse Fisher information of the variational distribution q(r]\ip). Using this B, the step 
direction is the natural gradient of the lower bound, and therefore not only simplifies the algorithm, but 
also gives an efficient step direction ll23l . The resulting variational update is a p s -weighted combination 
of the old sufficient statistics for q with the new ones calculated over data indexed by C s . 



B. The inference algorithm 

We develop a stochastic variational inference algorithm for approximate posterior inference of the 
nHDP topic model. As discussed in our general review of stochastic inference, this entails optimizing 
the local variational parameters for a subset of documents, followed by a step along the natural gradient 
of the global variational parameters. We distinguish between local and global variables for the nHDP in 
Table [n] In Table [n] we also give the variational q distributions selected for each variable. In almost all 
cases, we select this distribution to be in the same family as the prior. We point out two additional latent 
indicator variables that we have added for inference; n , which indicates the topic from which Wa,n is 
drawn, and zf- , which points to the atom in Gi for the jth break in using the construction given 
in @. 

In addition to local and global variational parameter updates, we introduce a third aspect to our inference 
algorithm. Before optimizing any variational parameters, we select a subtree from T for each document 
using a greedy algorithm. This greedy algorithm is performed with respect to the variational objective 
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TABLE I 

A LIST OF THE LOCAL AND GLOBAL VARIABLES AND THEIR RESPECTIVE q DISTRIBUTIONS FOR THE NHDP TOPIC MODEL. 



Global variables: 


topic probability vector for node i 


q{0i) = 


- Dirichlet(6*;|Ai,i, . . . , A;,v) 




stick proportion for the top-level DP for node i 


q(Vt,j) 


= Beta(V5, J |r«,r«f) 


Local variables: 


stick proportion for second-level DP for node i 


q(Vu) 


= Beta(V#>|u$,t#) 


(d) 


index pointer to atom in G; for jth break in (?j 




= 8 z . (k), k = 1,2,... 


u d> , 


beta distributed switch probability for node i 




= Beta(Ud,i\ad,i,bd,i) 




topic indicator for word n in document d 




= Discrete(c d ,„ \v d>n ) 



function, and so we are still performing variational inference. This limits the number of paths for which 
variational parameters must be learned for a given document, which further speeds up inference. We 
discuss this greedy algorithm below, followed by the variational parameter updates for the local and 
global q distributions. 

1) Greedy subtree selection: As mentioned, we perform a greedy algorithm with respect to the 
variational objective function to determine a subtree from T for each document. We first describe the 
algorithm followed by a mathematical representation. Starting from the root node, we sequentially add 
nodes from T from those currently "activated." An activated node is one whose parent is contained 
within the subtree but which is not itself in the subtree. We hold the q distributions for the document- 
specific beta distributions fixed to their priors and set the variational distribution for each word's topic 
indicator to zero on all unactivated nodes. We then ask: Which of the activated nodes not currently in 
the subtree will lead to the greatest increase in the variational objective? This only involves optimizing 
the variational parameter for each word over the current subtree plus the candidate node, which does not 
require iterating. We continue adding the maximizing node until the marginal increase in the objective 
falls below a threshold. We formalize this process below. 

a) Coordinate update for q(zf^): As defined in Table nj zf- is the variable that indicates the index 
of the atom from the top-level DP G,- pointed to by the jth stick-breaking weight in G 1 - \ We select a 
delta q distribution for this variable, meaning we make a hard assignment for this value. Starting with an 
empty tree, all atoms in G, constitute the activated set. Adding the first node is equivalent to determining 
the value for zj^\; in general, creating a subtree for Td, denoted T' d , is equivalent to determining which 
zf- to include in T' d and the atoms to which they point. 

For a subtree of size t corresponding to document d, let the set T d t contain the index values of 
the included nodes, let Sd,t = {i ■ pa(i) € ld,t,i $ %d,t}- Also, let Cd,t,i' denote the conditions that 
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u d,n{i) = for all i Id,t U i' and that q(-) is set fixed to the prior for all other document specific 
distributions. Then provided the marginal increase in the variational objective is above a preset threshold, 
we increment the subtree by Id,t+l 1d,t Ui*, where 

arg max Y] max E q [\np(w d , n \cd,n, 0)} + E q [hip(c dn , z^\V, V d , U d )} - E q [lnq(c d ,n)}- (16) 

' 65 d-t ' * Vd,n-C d t i t 

n=l 



The optimal values for Vd,n are given below in Equation ( 17 1. We note two aspects of this greedy 
algorithm. First, though the stick-breaking construction of the second-level DP given in Q allows for 
atoms to repeat, in this algorithm each added atom is new, since there is no advantage in duplicating 
atoms. Therefore, the algorithm approximates each G^ by selecting and reordering a subset of atoms 
from Gi for its stick-breaking construction. (The subtree T' d may contain no atoms or one atom from a 
Gi.) The second aspect we point out is the changing prior on the same node in T. If the atom 0u m \ is 
a candidate for addition, then it remains a candidate until it is either selected by a zf^ , or the algorithm 
terminates. The prior on selecting this atom changes, however, depending on whether it is a candidate for 
zf- or z-j,. Therefore, incorporating a sibling of #( I)m ) impacts the prior on incorporating B^ m y This 
penalty corresponds to the prior on word indicators, and is in addition to the penalty of the atom itself 
from the top-level DP. 

2 ) Coordinate updates for document variables: Given the subtree T' d selected for document d, we 
optimize the variational parameters for the q distributions on c d>n , and Ud,i over that subtree. 
a) Coordinate update for q(c d: n)' The variational distribution on the path for word W djTl is 

v d>n (f) « exp {Ejlnfl^J + E q [In ir dji \ } , (17) 

where the prior term 71"^,- is the tree-structured prior of the nHDP, 



Kd,i 



17 U-(vS d )u ,.(i-K (d) ^ I(4 ^ } 



Ud,iY[(l-Ud,i' 



i'Ci 



(18) 



(i',i)Ci 

The expectation Eg [In = tp(Xi :W ) — ip(Ylw where ifi(-) is the digamma function. Similarly, for 
a random variable Y ~ Beta(a, 6), E[ln Y] = ip(a) - ip(a + b) and E[ln(l - Y)] = tp(b) - ip{a + 6). The 
corresponding values of a and b for U and V are given in their respective updates below. 

We note that, given the subtree of Td the distribution on paths has a familiar feel as LDA, but where 
LDA uses a flat Dirichlet prior on iTd, the nHDP uses a prior that is the product of several beta random 
variables having a tree-structured form. Though the form is more complicated, the independence results 
in simple closed- form updates for these beta variables that only depend on Vdn- 
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b) Coordinate update for q{V^): The variational parameter updates for the second-level stick- 
breaking proportions are 

id) 



U i,3 



1+ E e£i^»(0. d9) 

i''-(i,j)Qi' 

V< S = ^+ E I ( U ->i-t4m = ^ + l)})En^l^n(0- (20) 
i':iCi' 

The statistic for the first parameter is the expected number of words in document d that pass through or 
stop at node The statistic for the second parameter is the expected number of words from document 
d whose paths pass through the same parent i, but then transition to a node with index greater than j 
according to the indicators zf^ from the second-level stick-breaking construction of G- . 

c) Coordinate update for q(Ud,i): The variational parameter updates for the switching probabilities 
are similar to those of the second-level stick-breaking process, but collect the statistics from v& n in a 
slightly different way, 

ad,i = 71 +1^=1^(1)1 ( 21 ) 
b d ,i = 72+ E £n=i (22) 

i'dCi' 

The statistic for the first parameter is the expected number of words that use the topic at node i. The 
statistic for the second parameter is the expected number of words that pass through node i but do not 
terminate there. 

3 ) Stochastic updates for corpus variables: After selecting the subtrees and updating the local document- 
specific variational parameters for each document d in sub-batch s, we take a step in the direction of the 
natural gradient of the parameters of the q distributions on the global variables. These include the topics 
8i and the top-level stick-breaking proportions V^j. 

a) Stochastic update for q(6i): For the stochastic update of the Dirichlet q distributions on each 
topic 6i, first form the vector X' { of sufficient statistics using the data in sub-batch s, 

A U = ^ E £n=l Vd,n(inW d ,n =™}, W = 1,...,V. 

This vector contains the expected count of the number of words with index w that originate from topic 



9i over documents indexed by C s . According to the stochastic inference theory in Section IV- A2 this 
number is scaled up to a corpus of size D. The update to the variational parameters for the associated q 
distribution is 

Kv 1 = *o + 0--P')Kv>+pAw- (2 3 ) 
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We see a blending of the old with the new in this update. Since p s — > as s increases, the algorithm uses 
less and less information from new sub-groups of documents, which reflects the increasing confidence 
in this parameter value as more data is seen. 

b) Stochastic update for q(Vi t j): Similarly to we first collect the sufficient statistics for the q 
distribution on Vi u j from the documents in sub-batch s, 

T L = ]§-lT, Wield}, t& = r^E £l{(l»(ii),.7) 

The first value scales up the number of documents in sub-batch s that include atom Onj) in their subtree; 
the second value scales up the number of times an atom of higher index value in the same DP is used 
by a document in sub-batch s. The update to the global variational parameters are 

r«( S + l) = l + (l-p.)T$(8) + p,Tij, (24) 

rg(a + l) = a + (l-p s )T$(s) + p s T!' hj . (25) 

Again, we see a blending of old information with new. 

V. Experiments 

We present an empirical evaluation of the nested HDP topic model in the stochastic and the batch 
inference settings. We first present batch results on three smaller data sets to verify that our multi- 
path approach gives an improvement over the single-path nested CRP We then move to the stochastic 
inference setting, where we perform experiments on 1.8 million documents from The New York Times and 
3.3 million documents from Wikipedia. We compare with other recent stochastic inference algorithms for 
topic models: stochastic LDA [8] and the stochastic HDP [9]. Before presenting our results, we discuss 
our method for initializing the topic q distributions of the tree. 

A. Initialization 

As with most Bayesian models, inference for hierarchical topic models can benefit greatly from a good 
initialization. Our goal is to find a method for quickly centering the posterior mean of each topic so that 
they contain some information about their hierarchical relationships. We briefly discuss our approach for 
initializing the global variational topic parameters A,- of the nHDP 

Using a small set of documents from the training set, we form the empirical distribution for each 
document on the vocabulary. We then perform k-means clustering of these probability vectors using the 
L\ distance measure. At the top level, we partition the data into n\ groups, corresponding to n\ children 
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of the root node. We then subtract the mean of a group (a probability vector) from all data within that 
group, set any negative values to zero and renormalize. We loosely think of this as the "probability of 
what remains" — a distribution on words not captured by the parent distributions. Within each group we 
again perform k-means clustering, obtaining n<i probability vectors for each of the n\ groups, and again 
subtracting, setting negative values to zero and renormalizing the remainder of each probability vector 
for a document. 

Through this hierarchical k-means clustering, we obtain n\ probability vectors at the top level, 
probability vectors beneath each top-level vector for the second level, 713 probability vectors beneath each 
of these second-level vectors, etc. The n% vectors obtained from any sub-group of data are refinements 
of an already coherent sub-group of data, since that sub-group is itself a cluster from a larger group. 
Therefore, the resulting tree will have some thematic coherence. The clusters from this algorithm parallel 
the nodes within the nHDP tree. For a mean probability vector A; obtained from this algorithm, we set 
the corresponding variational parameter for the topic Dirichlet q distribution to A/ = N(p\i + (1 — p)l /V) 
for p 6 [0, 1] and N a scaling factor. This initializes the mean of 0/ to be slightly peaked around A,-, 
while the uniform vector and p determine the variance. In our algorithms we set p = 0.5 and N equal 
to the number of documents. 

B. A batch comparison 

Before comparing our stochastic inference algorithm for the nHDP with similar algorithms for LDA 
and the HDP, we compare a batch version with the nCRP on three smaller data sets. This will verify 
the advantage of giving each document access to the entire tree versus forcing each document to follow 
one path. We compare the variational nHDP topic model with both the variational nCRP Q and the 
Gibbs sampling nCRP [1]- We consider three corpora for our experiments: (i) The Journal of the ACM, a 
collection of 536 abstracts from the years 1987-2004 with vocabulary size 1,539; (ii) The Psychological 
Review, a collection of 1,272 abstracts from the years 1967-2003 with vocabulary size 1,971; and (Hi) 
The Proceedings of the National Academy of Science, a collection of 5,000 abstracts from the years 
1991-2001 with a vocabulary size of 7,762. The average number of words per document for the three 
corpora are 45, 108 and 179, respectively. 

Variational inference for Dirichlet priors uses a truncation of the variational distribution, which limits 
the number of topics that are learned [24] [25 ]. This truncation is set to a number larger than the anticipated 
number of topics necessary for modeling the data set, but can adapt if more are needed ||26ll . We use a 
truncated tree of (10,7,5) for modeling these corpora, where 10 children of the root node each have 7 
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TABLE II 

Comparison of the nHDP with the nCRP on three smaller problems. 



Method\Data set 


JACM Psych. Review PNAS 


Variational nHDP 
Variational nCRP 
Gibbs nCRP 


-5.405 ± 0.012 -5.674 ± 0.019 -6.304 ± 0.003 
-5.433 ± 0.010 -5.843 ± 0.015 -6.574 ± 0.005 
-5.392 ± 0.005 -5.783 ± 0.015 -6.496 ± 0.007 



children, which themselves each have 5 children for a total of 420 nodes. Following previous work on 
the nCRP, we truncate the tree to three levels. Also, because these three data sets contain stop words, we 
follow [2 ] and [ 1 ] by including a root node shared by all documents for this batch problem. Following 
|2l , we perform five-fold cross validation to evaluate performance on each corpora. 

We present our results in Table [n] We see that for all data sets, the variational nHDP outperforms the 
variational nCRP. For the two larger data sets, the variational nHDP also outperforms Gibbs sampling for 
the nCRP Given the relative sizes of these corpora, we see that the benefit of learning a per-document 
distribution on the full tree rather than a path appears to increase as the corpus size and document size 
increase. Since we are interested in the "Big Data" regime, this strongly hints at an advantage of our 
nHDP approach over the nCRP 



C. Stochastic inference for The New York Times and Wikipedia 

We next present an evaluation of our stochastic variational inference algorithm on The New York Times 
and Wikipedia. These are both very large data sets, with The New York Times containing roughly 1.8 
million articles and Wikipedia roughly 3.3 million web pages. The average document size is somewhat 
larger than those considered in our batch experiments as well, with an article from The New York Times 
containing 254 words on average taken from a vocabulary size of 8,000, and Wikipedia 164 words on 
average taken from a vocabulary size of 7,702. 



1) Setup: We use the algorithm discussed in Section V-A to initialize a three-level tree with (20, 10, 5) 
child nodes per level, giving a total of 1,220 initial topics. For the Dirichlet processes, we set all top-level 
DP concentration parameters to a = 5 and the second- level DP concentration parameters to /? = 1. For 
the switching probabilities U, we set the beta distribution hyperparameters to 71 = 2/3 and 72 = 4/3, 
which takes the weight of a uniform prior and skews it toward smaller values, slightly encouraging a 
word to continue down the tree. For our greedy subtree selection algorithm, we stop adding nodes to the 
subtree when the marginal improvement to the lower bound falls below 10~ 2 . When optimizing the local 
variational parameters of a document given its subtree, we continue iterating until the absolute change 
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in the empirical distribution of words on the tree falls below 10 . 

We hold out a data set for each corpus for testing; we hold out 14,268 documents for testing The New 
York Times and 8,704 documents for testing Wikipedia. We quantitatively assess the quality of the tree at 
any given point in the algorithm as follows: Holding the top-level variational parameters fixed, for each 
test document we randomly partition the words into a 75/25 percent split. We then learn document-specific 
variational parameters for the 75% portion. Following E27l ll2"1. we use the mean of each q distribution 
to form a predictive distribution for the remaining words of that document. With this distribution, we 
calculate the average per- word log likelihood of the 25% portion to assess performance. For comparison, 
we evaluate stochastic inference algorithms for LDA and the HDP in the same manner. In all algorithms, 
we use a sub-batch size of \C S \ = 5000 at step s and set the learning rate to p s = (1 + s)~ 75 . We note 
that Hoffman, et al. provide a detailed evaluation of these settings, and while performance depends 
on their values, relative performance remains consistent; these settings are in the good performance range 
according to their evaluation and our qualitative results support this conclusion on these data sets. 

2 ) The New York Times: We first present our results for The New York Times. In Figure [2] we show 
the log likelihood on the test set as a function of number of documents seen by the model. We see an 
improvement in all algorithms as the amount of data seen increases. We also note an improvement in 
the performance of the nHDP compared with LDA and the HDP. In Figure [3] we show document-level 
statistics from the test set at the final step of the algorithm. These include the sizes of the subtrees, a 
breakdown by level of these subtrees, and word allocations by level. We note that while the tree has three 
levels, roughly eight topics are being used (in varying degrees) per document. This is in contrast to the 
three topics that would be available to any document with the nCRP Thus there is a clear advantage in 
allowing each document to have access to the entire tree. 

In Figure [4] we show example topics from the model and their relative structure. We show four topics 
from the top level of the tree (shaded), and connect topics according to parent/child relationship. The 
model learns a meaningful hierarchical structure; for example, the sports subtree branches into the various 
sports, which themselves appear to branch by teams. In the foreign affairs subtree, children tend to group 
by major subregion and then branch out into subregion or issue. In Figure [5^ we give a sense of the size 
of the tree as a function of documents seen. Since all topics aren't used equally, we show the number of 
nodes containing 90%, 99% and 99.9% of all paths within the subtrees. 

3 ) Wikipedia: We find similar results for Wikipedia as for The New York Times. In Figures [6] and [7] we 
show results corresponding to Figures [2] and [3] for The New York Times. We again see an improvement 
in performance for the nHDP over LDA and the HDP, as well as the increased usage of the tree with 
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Fig. 2. The New York Times: Average per- word log likelihood on a held-out test set as a function of training documents seen. 




(a) Size of subtree (b) Avg number nodes/level (c) Avg number of words/level 

& Avg number used 

Fig. 3. The New York Times: Per-document statistics from the test set using the tree at the final step of the algorithm, (a) 
A histogram of the size of the subtree selected for a document, (b) The average number of nodes by level within the subtree 
(white), and the average number by level that have at least one expected observation (black), (c) The average number of words 
allocated to each level of the tree per document. 
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Fig. 4. Tree-structured topics from The New York Times. The shaded node is the top-level node and lines indicate dependencies 
within the tree. In general, topics are learning in increasing levels of specificity. For clarity, we have removed grammatical 
variations of the same word, such as "scientist" and "scientists." 
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Fig. 5. Tree size: The smallest number of nodes containing 90%, 99% and 99.9% of all paths as a function of documents 
seen for (a) The New York Times, and (b) Wikipedia. 



the nHDP than would be available in the nCRP. In Figure [8] we see example subtrees used by three 
documents. We note that the topics contain many more function words than for The New York Times, but 
an underlying hierarchical structure is uncovered that would be unlikely to arise along one path, as the 
nCRP would require. In Figure [5}) we again show the size of the tree as a function of documents seen by 
showing the number of nodes containing 90%, 99% and 99.9% of all paths from the subtrees. As with 
The New York Times, the model simplifies significantly from the original 1,220 nodes initialized. 

VI. Conclusion 

We have presented the nested hierarchical Dirichlet process (nHDP), an extension of the nested Chinese 
restaurant process (nCRP) that allows each observation to follow its own path to a topic in the tree. 
Starting with a stick-breaking construction for the nCRP, the new model samples document-specific path 
distributions for a shared tree using a hierarchy of Dirichlet processes. By giving a document access to 
the entire tree, we are able to borrow thematic content from various parts of the tree in constructing 
a document. In our experiments we showed that this led to a general improvement over the nCRP for 
hierarchical topic modeling. In addition, we have developed a stochastic variational inference algorithm 
that is scalable to very large data sets. We compared the stochastic nHDP topic model with stochastic LDA 
and HDP on large collections from The New York Times and Wikipedia, where we showed an improvement 
in predictive ability with our tree-structured prior. Qualitative results on these corpora indicate that the 
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Number of documents seen 



Fig. 6. Wikipedia: Average per-word log likelihood on a held-out test set as a function of training documents seen. 




(a) Size of subtree (b) Avg number nodes/level (c) Avg number of words/level 

& Avg number used 

Fig. 7. Wikipedia: Per-document statistics from the test set using the tree at the final step of the algorithm, (a) A histogram 
of the size of the subtree selected for a document, (b) The average number of nodes by level within the subtree (white), and 
the average number by level that have at least one expected observation (black), (c) The average number of words allocated to 
each level of the tree per document. 
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castle 



computer science 




Fig. 8. Examples of subtrees for three articles from Wikipedia. The three sizes of font indicate differentiate the more probable 
topics from the less probable. 



nHDP can learn meaningful topic hierarchies, and that documents benefit by taking advantage of the 
entire tree. 



References 

[1] D. Blei, T. Griffiths, and M. Jordan, "The nested Chinese restaurant process and Bayesian nonparametric inference of topic 

hierarchies," Journal of the ACM, vol. 57, no. 2, pp. 7:1-30, 2010. 
[2] C. Wang and D. Blei, "Variational inference for the nested Chinese restaurant process," in Advances in Neural Information 

Processing Systems, 2009. 

[3] J. H. Kim, D. Kim, S. Kim, and A. Oh, "Modeling topic hierarchies with the recursive Chinese restaurant process," in 

International Conference on Information and Knowledge Management (CIKM), 2012. 
[4] Y. Teh, M. Jordan, M. Beal, and D. Blei, "Hierarchical Dirichlet processes," Journal of the American Statistical Association, 

vol. 101, no. 476, pp. 1566-1581, 2006. 



DRAFT 



26 



[5] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 
2003. 

[6] M. Jordan, "Message from the President: The era of Big Data," ISBA Bulletin, vol. 18, no. 2, pp. 1-3, 2011. 
[7] M. Hoffman, D. Blei, C. Wang, and J. Paisley, "Stochastic variational inference," arXiv: 1 206.705 1 , 2012. 
[8] M. Hoffman, D. Blei, and F. Bach, "Online learning for latent Dirichlet allocation," in Advances in Neural Information 
Processing Systems, 2010. 

[9] C. Wang, J. Paisley, and D. Blei, "Online learning for the hierarchical Dirichlet process," in Proceedings of the 14th 
International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 15, 2011, pp. 752-760. 

[10] T. Ferguson, "A Bayesian analysis of some nonparametric problems," The Annals of Statistics, vol. 1, pp. 209-230, 1973. 

[11] D. Blackwell and J. MacQueen, "Ferguson distributions via Polya urn schemes," Annals of Statistics, vol. 1, no. 2, pp. 
353-355, 1973. 

[12] J. Sethuraman, "A constructive definition of Dirichlet priors," Statistica Sinica, vol. 4, pp. 639-650, 1994. 
[13] D. Aldous, Exchangeability and Related Topics, ser. Ecole d'Ete Probabilites de Saint-Flour XIII-1983 pages 1-198. 
Springer, 1985. 

[14] A. Rodriguez, D. Dunson, and A. Gelfand, "The nested Dirichlet process," Journal of the American Statistical Association, 
vol. 103, pp. 1131-1154, 2008. 

[15] L. Ren, L. Carin, and D. Dunson, "The dynamic hierarchical Dirichlet process," in International Conference on Machine 
Learning, 2008. 

[16] E. Airoldi, D. Blei, S. Fienberg, and E. Xing, "Mixed membership stochastic blockmodels," Journal of Machine Learning 

Research, vol. 9, pp. 1981-2014, 2008. 
[17] E. Fox, E. Sudderth, M. Jordan, and A. Willsky, "A Sticky HDP-HMM with Application to Speaker Diarization," Annals 

of Applied Statistics, vol. 5, no. 2A, pp. 1020-1056, 2011. 
[18] R. Adams, Z. Ghahramani, and M. Jordan, "Tree-structured stick breaking for hierarchical data," in Advances in Neural 

Information Processing Systems, 2010. 
[19] M. Sato, "Online model selection based on the variational Bayes," Neural Computation, vol. 13, no. 7, pp. 1649-1681, 

2001. 

[20] J. Paisley, C. Wang, and D. Blei, "The discrete infinite logistic normal distribution," Bayesian Analysis, vol. 7, no. 2, pp. 
235-272, 2012. 

[21] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, "An introduction to variational methods for graphical models," 

Machine Learning, vol. 37, pp. 183-233, 1999. 
[22] J. Winn and C. Bishop, "Variational message passing," Journal of Machine Learning Research, vol. 6, pp. 661-694, 2005. 
[23] S. Amari, "Natural gradient works efficiently in learning," Neural Computation, vol. 10, no. 2, pp. 251-276, 1998. 
[24] D. Blei and M. Jordan, "Variational inference for Dirichlet process mixtures," Bayesian Analysis, vol. 1, no. 1, pp. 121-144, 

2005. 

[25] K. Kurihara, M. Welling, and N. Vlassis, "Accelerated variational DP mixture models," in Advances in Neural Information 

Processing Systems 19, 2006, pp. 761-768. 
[26] C. Wang and D. Blei, "Truncation-free online variational inference for Bayesian nonparametric models," in Advances in 

Neural Information Processing Systems, 2012. 
[27] Y. Teh, K. Kurihara, and M. Welling, "Collapsed variational inference for HDP," in Advances in Neural Information 

Processing Systems, 2008. 



DRAFT 



